'BonTen' - Corpus Concordance System for 'NINJAL Web Japanese Corpus'
نویسندگان
چکیده
The National Institute for Japanese Language and Linguistics, Japan (NINJAL) has undertaken a corpus compilation project to construct a web corpus for linguistic research comprising 25 billion words. The project is divided into four parts: page collection, linguistic analysis, development of the corpus concordance system, and preservation. This article presents a corpus concordance system named ‘BonTen’, which enables a ten-billion-scaled corpus to be queried by string, a sequence of morphological information or a subtree of the syntactic dependency structure.
منابع مشابه
Concordance-Based Data-Driven Learning Activities and Learning English Phrasal Verbs in EFL Classrooms
In spite of the highly beneficial applications of corpus linguistics in language pedagogy, it has not found its way into mainstream EFL. The major reasons seem to be the teachers’ lack of training and the unavailability of resources, especially computers in language classes. Phrasal verbs have been shown to be a problematic area of learning English as a foreign language due to their semantic op...
متن کاملUtilizing lexical data from a web-derived corpus to expand productive collocation knowledge
Collocations are of great importance for second language learners, and a learner’s knowledge of them plays a key role in producing language fluently (Nation, 2001: 323). In this article we describe and evaluate an innovative system that uses a Web-derived corpus and digital library software to produce a vast concordance and present it in a way that helps students use collocations more effective...
متن کاملA Web Corpus and Word Sketches for Japanese
Of all the major world languages, Japanese is lagging behind in terms of publicly accessible and searchable corpora. In this paper we describe the development of JpWaC (Japanese Web as Corpus), a large corpus of 400 million words of Japanese web text, and its encoding for the Sketch Engine. The Sketch Engine is a web-based corpus query tool that supports fast concordancing, grammatical processi...
متن کاملCorpus Linguistics with BNCweb - a Practical Guide
Book synopsis This book presents a richly illustrated, hands-on discussion of one of the fastest growing fields in linguistics today. The authors address key methodological issues in corpus linguistics, such as collocations, keywords and the categorization of concordance lines. They show how these topics can be explored step-by-step with BNCweb, a user-friendly web-based tool that supports soph...
متن کاملپیکره اعلام: یک پیکره استاندارد واحدهای اسمی برای زبان فارسی
Named entity recognition (NER) is a natural language processing (NLP) problem that is mainly used for text summarization, data mining, data retrieval, question and answering, machine translation, and document classification systems. A NER system is tasked with determining the border of each named entity, recognizing its type and classifying it into predefined categories. The categories of named...
متن کامل